Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module/igv/1.0 #303

Open
wants to merge 137 commits into
base: master
Choose a base branch
from
Open

Module/igv/1.0 #303

wants to merge 137 commits into from

Conversation

mannycruz
Copy link
Contributor

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for New Module

Required

  • I used the cookiecutter template and updated the placeholder rules.

  • The snakemake rules follow the design guidelines.

    • All references to the rules object (e.g. for input files) are wrapped with str().
  • Every rule in the module is either listed under localrules or has the threads and resources directives.

  • Input and output files are being symlinked into the CFG["inputs"] and CFG["outputs"] subdirectories, respectively.

  • I grouped the input symlinking rule to the next job that uses the input files.

  • I updated the final target rule (*_all) to include every output rule.

  • I explained important module design decisions in CHANGELOG.md.

  • I tested the module on real data for all supported seq_type values.

  • I updated the default.yaml configuration file to provide default values for each rule in the module snakefile.

  • I did not set any global wildcard constraints. Any/all wildcard constraints are set on a per-rule basis.

  • I ensured that all symbolic links are relative and self-contained (i.e. do not point outside of the repository).

  • I replaced every value that should (or might need to) be updated in the default configuration file with __UPDATE__.

  • I recursively searched for all comments containing TODO to ensure none were left. For example:

    grep -r TODO modules/<module_name>/1.0

If applicable

  • I added more granular output subdirectories.

  • I added rules to the reference_files workflow to generate any new reference files.

  • I added subdirectories with large intermediate files to the list of scratch_subdirectories in the default.yaml configuration file.

  • I updated the list of available wildcards for the input files in the default.yaml configuration file.

Checklist for Updated Module

Important! If you are updating the module version, ensure the previous version of the module is restored from master.
If you want to restore a deleted file or directory from the remote master, you can use git checkout origin/master path/to/file,
then a git commit will ensure that file is tracked on your branch again.
Example:

mv modules/strelka/1.1 modules/strelka/1.2
git checkout origin/master modules/strelka/1.1

igv:

inputs:
# Available wildcards: {unix_group} {seq_type} {tumour_sample_id} {normal_sample_id} {pair_status} {genome_build}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to have {unix_group} as a wildcard?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a standard column in the lcr-modules schema and is something specific to gambl. There are ways to deal with this - can show some examples and we can talk about this more before lab meeting today 😄

@Kdreval Kdreval self-requested a review March 28, 2024 17:43
@Kdreval Kdreval self-assigned this Mar 28, 2024
Copy link
Collaborator

@Kdreval Kdreval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!
I am mostly wondering if we can reuse existing conda envs and scripts to have more standard approach and decrease the maintenance burden 😄
This looks great, long way since the early version! 🚀

maf: "__UPDATE__"

regions:
# Provide regions files as lists in their respective genome builds so that liftover of coordinates occurs properly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if nothing is specified here? Can we add the "__UPDATE__" to anything that has to be filled in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an "__UPDATE__" string and specified that at least one regions file must be provided

maf:
grch37: []
hg38: []
mutation_id:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add here the example formatting of what is expected in that file? Does it have to have a header and expects certain column names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ! I added what column and format is required for mutation_id file format

options:
genome_map:
# Map metadata builds to grch37 and hg38 so that MAF file locations are determined correctly
grch37: ["__UPDATE__"] # e.g ["grch37","hg19","hs37d5"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you sort of know this upfront, so maybe it is better to fill in these keys and then the comment will be that you can add other genome builds to this list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅


liftover_regions:
reference_chain_file:
grch37: "genomes/grch37/chains/grch37/hg19ToHg38.over.chain"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generated with the reference files so you can directly request the reference files output instead of having this listed here. This can be an example:

def get_chain(wildcards):
if "38" in str({wildcards.genome_build}):
return reference_files("genomes/{genome_build}/chains/grch38/hg38ToHg19.over.chain")
else:
return reference_files("genomes/{genome_build}/chains/grch37/hg19ToHg38.over.chain")
# Convert the bed file in hg38 coordinates into hg19 coordinates
rule _run_liftover:
input:
native = rules._liftover_convert_2_bed.output.bed,
chains = get_chain
output:
lifted = temp(CFG["dirs"]["liftover"] + "from--{seq_type}--{genome_build}/{tumour_id}--{normal_id}--{pair_status}.{tool}--{type}.lifted_{chain}.bed"),
unmapped = CFG["dirs"]["liftover"] + "from--{seq_type}--{genome_build}/{tumour_id}--{normal_id}--{pair_status}.{tool}--{type}.lifted_{chain}.unmapped.bed"
log:
stderr = CFG["logs"]["liftover"] + "from--{seq_type}--{genome_build}/{tumour_id}--{normal_id}--{pair_status}.{tool}--{type}.lifted_{chain}.stderr.log"
params:
mismatch = CFG["options"]["min_mismatch"]
conda:
CFG["conda_envs"]["liftover-366"]
wildcard_constraints:
chain = "hg38ToHg19|hg19ToHg38"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅

run_unpaired_tumours_with: "unmatched_normal"
run_paired_tumours_as_unpaired: False

slms_3:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you instead just include config from slms-3?
I think this will conflict if you have snakefile that will attempt to run both slms-3 and this module because snakemake won't like duplicated config values - or will just use whatever is imported last so may have unexpected consequence in that scenario

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good idea, I removed this part from the config

bam = CFG["dirs"]["inputs"] + "bams/{seq_type}/{sample_id}.bam",
bai = CFG["dirs"]["inputs"] + "bams/{seq_type}/{sample_id}.bam.bai"
run:
op.absolute_symlink(input.bam, output.bam)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if this needs crai index whe the bam is a cram 😃

input:
maf = get_maf
output:
maf = CFG["dirs"]["inputs"] + "maf/{seq_type}--{genome_build}/{tumour_id}--{normal_sample_id}--{pair_status}.maf"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be missing the sample in the tumour_id wildcard?
Here is the part from config:

# Available wildcards: {unix_group} {seq_type} {tumour_sample_id} {normal_sample_id} {pair_status} {genome_build}

script:
config["lcr-modules"]["igv"]["scripts"]["format_regions"]

REGIONS_FORMAT = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a dictionary if everything is mapped to the same key bed? Maybe we can just refer to bed regardless of what is the region format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right- I removed it

script:
config["lcr-modules"]["igv"]["scripts"]["filter_script"]

def _get_maf(wildcards):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this does not conflict with the function defined above?

def get_maf(wildcards):
    unix_group = config["unix_group"]
    return expand(config["lcr-modules"]["igv"]["inputs"]["maf"], allow_missing=True, unix_group=unix_group)

Just because it is defined later the further call will use the "latest" definition?

Copy link
Contributor Author

@mannycruz mannycruz Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the earlier function because I do not need it to fill in unix_group value (this can be done from the snakefile that is used to launch the run), so not an issue anymore


rule _igv_download_igv:
output:
igv_zip = CFG["dirs"]["igv"] + "IGV_2.7.2.zip",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone doesn't run this on linux or wants to have a different version? Maybe this part should be configurable or better even use the conda instead of downloading the source file?
https://anaconda.org/bioconda/igv

@mannycruz
Copy link
Contributor Author

I have addressed most of the comments above, but I need to further investigate regarding:

plus i have made added some more changes since, so I will ask for an updated review once I'm finished

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants